DOCUMENT RESUME 



ED 332 636 



HE 024 615 



AUTHOR 
TITLE 

INSTITUTION 
SPONS AGENCY 



REPORT NO 
PUB DATE 
NOTE 

PUB TYPE 



EDRS PRICE 
DESCRIPTORS 



Klitzner, Michael; Stewart, Kathryn 
Evaluating Faculty Development and Clinical Training 
Programs in Substance Abuse: A Guide Book. 
Pacific Inst, for Research and Evaluation, Walnut 
Creelt, CA. 

National Inst, on Alcohol Abuse and Alcoholism 

(DHHS), Rocltville, Md.; National Inst, on Drug Abuse 

(DHHS/PHS), RocKville, Md. 

RP0778 

Jun 90 

33p. 

Guides - Non-Classroom Use (055) — Reports - 
Descriptive (141) 

MF01/PC02 Plus Postage. 

Alcohol Abuse; .Hied Health Occupations Education; 
-Clinical Teaching (Health Professions); Drug Abuse; 
-Evaluation Methods; -Faculty Development; Graduate 
Medical Education; Higher Education; Medical 
Education; Medical school Faculty; Mental Health; 
•Program Evaluation; Qualitative Research; -Research 
Methodology; Sampling; Statistical Bias; -substance 
Abuse 



ABSTRACT 

intended to provide an overview of program evaluation 
as it applies to the evaluation of faculty development and clinical 
training programs in substance abuse for health and mental health 
professional schools, this guide enables program developers and other 
faculty to work as partners with evaluatorj, in the development of 
evaluation designs that meet the specialized needs of faculty 
development and clinical training programs. Section I discusses 
conceptual issues in program evaluation, including the uses of 
evaluation (management and monitoring, program description, program 
improvement, accountability, and creating new knowledge); the major 
options (forraative/summative, process/outcome/impact, 
quantitative/qualitative); and the benefits and risks of conducting 
evaluation studies. Section II, an introduction to research methods, 
includes the following discussions: sampling with known sampling 
errors (simple random, systematic, multistage random, stratified, 
cluster, stratified cluster, and sequential sampling); sampling 
without known sampling errors (convenience, quota, modal, purposive, 
and snowball sampling); sample- size and generalizability and sample 
size and statistical power; the validity of evaluations and potential 
sources of bias, including issues related to internal validity 
(history, maturation, testing, instrumentation, statistical 
regression, selection, mortality, interactions with selection, and 
ambiguity about the direction of causal influence) ; comparison and 
control groups; measurement of outcomes; and qualitative evaluation 
methods and analysis. (5 references) (JB) 



BEST COPY AVAILABLE 



EVALUATING 
FACULTY DEVELOPMENT AND 
CLINICAL TRAINING PROGRAMS 
IN SUBSTANCE ABUSE: 
A GUIDE BOOK 

Michael Klitzner, Ph.D. 
Kathryn Stewart, M.S. 

Pacific Institute for Research and Evaluation 
Bethesda. Maryland 

June, 1990 



Development of this Guide was supponed by the National 
Instimte on Alcohol Abuse and Alcoholism and the National 

Institute on Drug Abuse 

u s MfANTMCNT OF EDUCATtOM 

Oft* • of £duc«tion«i Be»eircn ami improy«m»ni 

EDUCATIONAL RF SOURCES INFORMATION 
CENTEP(EniCl 

V Thil documani hai b««n raproduCrd 
W received ''om the p«r»on O' tygar\,iaUor\ 

originating it 
n Minor changet Ken inacle to improve 

reproduction quaKty 



• Point* ol vie* or opiniontaiated m ihn act u 
ment Oc not necesianiy repfe»*ni otticiai 
OE"! ootilion or policy 



RP0778 



2 



ACKNOWLEDGMENTS 

This Guide Book is based on materials originally assembled for A Guide to First DU I Offender Program 
Evaluation prepared by the authors for the California Office of Traffic Safety. The authors wish to 
acknowledge the contribution of Paul Moberg, Ph.D., with whom the authors originally developed the 
approach to program evaluation reflected in this Guide Book. 



i 



TABLE OF CONTENTS 



ACKNOWLEDGEMENTS » 

TABLE OF CONTENTS " 

ABOUT THIS GUIDE BOOK 1 

SECTION I: ISSUES IN PROGRAM 2 
EVALUATION 

USES OF PROGRAM EVALUATION 2 

Evaluation for Management and Monitoring 3 

Evaluation for Program Description 3 
Evaluation for Program Improvement 
Evaluation for Accountability 
Evaluation for Creating New Knowledge 

MAJOR OPTIONS 5 

The Formative/Summative Dimension 3 

The Process/Outcome/ Impact Dimension 6 

The Quantitative/Qualitative Dimension 1 

BENEFITS AND RISKS 8 



Benefits 
Risks 



8 

9 



ERIC 



SECTION U: INTRODUCTION TO H 
RESEARCH METHODS 

SAMPLLNG U 

Samples with Known Sampling Errors 12 

ii A 



TABLE OF CONTENTS (CON'T) 



Simple Random Sampling ^- 

Systematic Sampling ^- 

Multistage Random Sampling ^ ^ 

Stratified Sampling 12 

Cluster Sampling 1 3 

Stratified Cluster Sampling 13 

Sequential Sampling 13 

Samples Without Known Sampling Errors 1*^ 

Convenience Sampling 1** 

Quota Sampling l** 

Modal Sampling 1** 
Purposive Sampling 

Snowball Sampling 14 

ISSUES RELATED TO SAMPLE SIZE 15 

Sample Size and Generalizability 15 

Sample Size and Statistical Power 15 

THE VALIDITY OF EVALUATIONS AND 16 
POTENTUL SOURCES OF BUS 

Issues Related to External Validity 16 

Sampling Biases 16 

Measurement Biases 17 



5 

iii 



TABLE OF CONTENTS (CONT) 



Issues Related to Internal Validity j7 

History 

Maturation 

Testing jg 

/ nstnunentation j g 

Statistical Regression jg 



Selection 
Mortality 

Interactions with Selection 



MEASURING OUTCOMES 
Self>Report Measures 
Other-Reports 

Behavioral Observation 

O iv 

ERIC 6 



19 
19 
19 



Ambiguity About the Direction of Causal 
Influence 

COMPARISON AND CONTROL GROUPS 19 
The True No-Treatment Control ->0 
The Waiting List Control 

The "Traditional Treatment" Control ->1 
The Minimal Treatment Control ^ j 

The Placebo Control 

The Non-Eligible Comparison Group ^ 



21 
22 
23 
23 



TABLE OF CONTENTS (CON'T) 



Records Data 

QUALITATIVE EVALUATION METHODS 
Data Collection Techniques 

Observation 

Participant Observation 

Qualitative Interviews 

QUALITATIVE ANALYSIS 

CONCLUSION 

NOTES 



I 



ABOUT THIS GUIDE BOOK 



There are many text books on program evaluation 
and this Guide Book is not intended to be one of 
them. Instead it is intended to provide an overview 
of program evaluation as it applies to the evalu- 
ation of faculty development and clinical training 
programs in substance abuse for health and mental 
health professional schools. The Guide Book ref- 
erences other resources that can be consulted for 
more detailed and technical information. 

The Guide Book is divided into two major sections. 
Secdon I discusses conceptual issues in program 
evaluation, including the uses of evaluation, the 
major options, and the benefits and risks of con- 
ducting evaluation smdies. Secdon n is intended 
as an introduction to research methods. It includes 
discussions of sampling, statistical power, the 
validity of evaluations and potential sources of 



bias, comparison and control groups, and measure- 
ment of outcomes. A section on data analysis is not 
included because this aspect of evaluation has be- 
come so complex in recent years that no simple 
treatment is possible. However, all universities 
have departments of statistics that may be called 
upon for assistance in data analysis. 

It is important to note that it is not expected that an 
individual unfamiliar with program evaluation will 
be able to plan an evaluation based on the material 
presented in this Guide Book. Rather, the purpose 
of the Guide is to enable program developers and 
other faculty to work as parmers with evaluators in 
the development of evaluation designs that meet 
the specialized needs of faculty development and 
clinical training programs. 



8 

ER?C , 

' i ... ... 



SECTION I: 
ISSUES IN PROGRAM EVALUATION 



USES OF PROGRAM EVALUATION 

In many ways, "program evaiuanon" is a mislead- 
ing phrase. "Evaluadon'* implies judgement of 
what is good and what is not good. Although 
ietennination of whether or not a program is 
^ood" is the goal of some types of program 
evaluation, it is by no means the only -- o- even the 
major goal of program evaluation in general. 
Unfonunately, the "valuative" dimension of pro- 
gram evaluation is what most often comes to mind 
for many people. Thus, individuals unfamiliar 
with the wide range of activities subsumed under 
program cvaluauon may approach this topic with 
some trepidation -- especially if it is their programs 
that are to be evaluated. 

A much more general goal of program evaluation 
" and one which underlies almost all evaluation 
activity ~ is the production of systematic informa- 
tion to be used in making decisions about pro- 
grams. The complexity and scope of these deci- 
sions can vary widely, as can the complexity and 
scope of the evaluation, but the principle is always 
the same: 

Evaluations are implememed to 
improve decision-making. 



All the diverse, specific purposes of a given evalu- 
adon derive from this principle. With this general 
principal in mind, we now examine some of the 
specific purposes of program evaluadon. In each 
case, the discussion will be tied to the kinds of 
decisions that underlie each specific purpose. 

Before discussing the specific purposes of pro- 
gram evaluadon. it is important to define clearly 
what is meant by a "program." In the area of 
faculty development and clinical training m sub- 
stance abuse, a program may be as broad as an 
entire instimrion's attempt to improve its instruc- 
tional program or as narrow as the techniques used 
by an individual pracddoner to confront substance 
abusing padents or counsel adolescents. A pro- 
gram may be the introducdon of new content into 
an exisdng course, or the implementadon of an 
innovadve teaching strategy. All of these "pro- 
grams" may be evaluated on many different levels 
and for a variety of purposes. Of course, some 
kinds of evaluadons will be more appropriate for 
some programs than for others. In general, how- 
ever, all of the types of evaluadon described below 
are applicable, to some degree, to any aspect of 
faculty development or clinical training programs. 



ERLC 



Evaluation for Management and Monitoring directions. 



One common use of program evaluation is the 
monitoring and management of programs. Espe- 
cially in large programs such as the implementa- 
tion of a new substance abuse curriculum or the 
opening of a new substance abuse unit in a teaching 
hospital, so much is going on that it is often 
difficult to keep track of the program on a day-to- 
day basis. 

Management information or monitoring systems 
are designed to meet ongoing information needs 
(i.e., those information needs that can be expected 
to exist as long as the program is in operation). In 
general, the questions addressed by such systems 
relate to events that are recurrent and routine. 
Examples of questions related to an institution's 
instructional program might include: 

How many hours of didactic and clinical instruc- 
tion on substance abuse are offered each quarter 
or semester? 

How many students are exposed each quarter or 
semester to a given course or clinical experience? 

What are the trends in student exam scores in the 
area of substance abuse? 

Similarly, questions concerning the provision of 
clinical services might include: 

How many patients are diagnosed each month with 
substance-related problems? On what services 
are they diagnosed? Where are they referred? 

How many beds are filled at any given time on the 
substance abuse unit? From where were patients 
referred? What are the intake diagnoses? 

Regular review of the data generated by a manage- 
ment information or monitoring system provides 
an up-to-date picture of the program as a whole or 
of specific program elements (e.g., a specific courc 
or practicuum) which can be used both to make 
shon-ierm corrections and to plan future program 



Evaluation for Program Description 

Often, there is a need to capture the overall opera- 
tion of a program or program element at a given 
point in time that is, to develop a comprehensive 
program description. Such a need might arise, for 
example, in developing a manual to allow the 
replication of a substance abuse curriculum or 
curriculum element. Systematic program descnp- 
tions are a major vehicle for communicatirg with 
the field, and, as discussed later, a comprehensive 
program description is a prerequisite for any evalu- 
ation activities designed to assess program effec- 
tiveness. In all cases, the need is to answer the 
question: 

What does this program look like as it actually 
operates? 

Many of the evaluation activities undertaken for 
the purpose of program description arc similar to 
those undertaken for the purposes of management 
and monitoring. The major differences between 
evaluation for the purpose of management and 
monitoring and evaluation for the purpose of de- 
scription concern the rime-frame of the data col- 
lection and the depth of the data collection. 

As suggested, management or monitoring systems 
are ongoing and provide a dynamic picture of 
changes over time. By contrast, a descriptive 
evaluation usually attempts to capture a "still pic- 
ture" of a program or program element as it oper- 
ates at a given point in time. Of course, a program 
description may contain historical or developmen- 
tal elements, but the emphasis is on the program as 
it operates now. 

Descriptive evaluations strive for a greater depth 
than can be captured in the numbers generated by 
a management information system. For example, 
a management information system might reveal 
that in a given semester, X number of learners 
attended a newly offered selective course on an- 
ticipatory guidance for iidolescent and prc-adolcs- 



ERIC 



310 



cent patients and their parents. In addition, a 
descriptive evaluation would seek information 
concerning the learners' experiences in the course, 
how they reacted, and whether or not learners felt 
they could or would apply the course content in 
clinical practice. Gathering such information might 
require observation of course sessions, interviews 
with learners, debriefing of instructors, and so on. 
Such in-depth data collection strategies are the 
hallmark of descriptive evaluations. It is the fme- 
grained detail they provide that often makes them 
so useful in management decision-making and in 
enhancing assessments of program effectiveness. 

Evaluation for Program Improvement 

All professional schools are eager to improve the 
msmicuonal program offered to learners. Program 
evaluation provides a mechanism for developing 
the information needed to make such improve- 
ments in a rational and planfiil manner. 

Program improvement can take many forms. In 
some cases, it might mean attracting more learners 
to important courses or fmding more appropriate 
clinical settings for clinical placements. In other 
coses it might mean making changes in the policies 
of the department or institution. In still other cases, 
program improvement might mean more effective 
or efficient ieaming, or increased impact on clini- 
cal practice behavior 

The questions that might be raised relevant to 
program improvement are numerous, but they all 
take the general form: 

How can this program or program element be 
made more effective? 

Sometimes the answers to these questions derive 
from management or monitoring systems or de- 
scriptive evaluations. Sometimes they require 
complex experimental designs. In either case, the 
emphasis is on identifying strengths upon which to 
capitalize and weaknesses in need of correction. In 
this sense, program evaluations for program im- 
provement tend to fit with the traditional "valu- 



ative" connotations of program evaluation dis- 
cussed earlier. However, the emphasis is on the 
use of data internally and in the overall service of 
the program. 

Evaluation for Accountability 

Evaluation for the purpose of accountability - that 
is, evaluation for the purpose of determining whether 
or not a pr ogram is meeting its stated objectives - 
- also informs decision making. Often, however, 
the decision maker is someone outside the program 
(e.g., funders or university administration), and 
sometimes the decision is one of program continu- 
ation or termination. Thus, accountability evalu- 
ations can be threatening to those involved. 

Accountability evaluations are similar to the evalu- 
ation activities described thus far, and may address 
many of the same questions. For example, an 
accountability evaluation of a clinical training 
program might address issues of course availabil- 
ity, course content, and course attendance (man- 
agement or monitoring evaluation), learner expe- 
riences with and reactions to the courses (descrip- 
tive evaluation), and/or strengths and weaknesses 
(program improvement evaluation). Thus, a pro- 
gram that has these evaluative elements in place is 
generally in a good position to respond to requests 
for at least some kinds of accountabilitv informa- 
tion. 

Simply demonstrating an evaluation-based approach 
to program development, implementation, and 
improvement is one indication that a program is 
well run and rationally managed. Thus, programs 
in which evaluation is an ongoing and integral 
component of program operations tend to fare well 
when required to be accountable to outside deci- 
sion makers. 

Evaluation for Creating New Knowledge 

A final purpose of evaluation is to develop and/or 
test hypotheses about programs and their effects. 
In the broadest sense, the goal of such evaluations 
is to contribute to the general knowledge base 



o 

ERIC 



about what docs or docs not lead to improvements The level of information gathered (Process. Out- 
in professional education, clinical practice behav- come. Impact} 
ior, and/or in patient outcomes regarding sub- 
stance abuse. The type of information gathered (Quantitative. 

Qualitative). 

If they are to be accepted by the scicndfic commu- 
nity, evaluations rJiat seek to contribute to the Before discussing these dimensions in more detai', 
general knowledge base must generally meet rig- it is important to emphasize diat they do not 
orous standards of research design. Many of these represent "ciiher/or" opdons. For example, while 
design issues are discussed in Section II of this an evaluation may be largely formative, it may 
Guide Book. It is important to note, however, that have summative components. Similarly, a largely 
even the niost rigorous evaluation design is greatly quantitative data collection effon may include 
enhanced if other evaluative elements are in place qualitative data collection as well. Thus, evalu- 
(good program monitoring, a capacity for detailed ation activities are most often combinations of 
program description, etc.). several components from each dimension, depend- 

ing upon the overall purpose of the evaluation 
It is also important to note that evaluations de- initiative, 
signed to pnxlucc new knowledge can have direct, 

practical implications for the insututions that con- The Formative/Summative Dimension 
duct them. For example, an institution considering 

the adoption of interactive laser disc technology as The terms "summative'* and "formative" refer to 
a pedagogical strategy, or the use of standardized the uses to which evaluation information is put. 
patients (actors trained to present a standardized Formative evaluation information is used for pro- 
complaint) as an assessment strategy may wish to gram improvement and is usually reviewed on a 
compare these strategies to traditional, but less regular basis as it is collected. A management or 
expensive strategies to determine if the improve- monitoring evaluation system in which dau arc 
ment in outcomes is worth the extra cost reviewed each quaner or semester would be a 

relevant example. Summative evaluation infor- 
MAJOR OPTIONS mation is most often used to assess the efficiency 

or effectiveness of a program tiiat has reached a 
Once the purpose of a given evaluation initiative stable operational state. In contrast to formative 
has been determined, the evaluator must choose information, summative evaluation information is 
among available evaluation options. Many of often not analyzed until all data coUcction is 
these options are methodological and are discussed completed in order to avoid contamination of the 
in detail in Section II. However, there are concep- program under consideration. A controlled ex- 
tual options that should be addressed before meth- perimental study comparing an innovative cur- 
odological options are considered. These concep- riculum module to a standard module is an cx- 
tual options determine the overall direction and ample of a summative evaluation, 
scope of the evaluation. They also guide, to some 

degree, die methodological decisions to follow. It There is a tendency to associate formative evalu- 
is these concepmal options that are discussed here, ation with program descriptive information and 

summative evaluation with assessment of program 
Program evaluation activities may be seen as vary- outcomes. In fact, any type of evaluation may be 
ing along three major conceptual dimensions. eitiier summative or formative, depending on die 

purpose of the evaluation and the uses to which the 
The overall goal of the evaluation (Formative, information is put. For example, a descriptive 
Summative} evaluation implemented in order to develop a 

ERIC ^2 



dissemination manual for an innovative program 
component (e.g., the use of standardized patients 
in a training) would be a suoomative evaluation 
because the purpose of the evaluation is to provide 
a finalized description of the component, and because 
the program component is stable at the point at 
which it is described. Similarly, an evaluation 
which provides an assessment of the effectiveness 
of a promising new educational technique would 
be considered formadve if the purpose of the study 
were to aid in the further development of the 
technique. 

In general, summadve evaluadon, whether ori- 
ented to description or outcome assessment will 
require a higher level of scientific rigor than will 
formadve evaluadon, and will, therefore, be more 
difficult and cosdy to implement. Keep in mind 
that it is inappropriate to carry out a sutmmtive 
evaluadon when a program is still in a state of 
evoludon, and when program components have 
not yet reached the level of stability required for a 
summadve assessment 

The Process/Outcome/Impact Dimension 

Some evaluators have found it useful to distinguish 
three levels of evaluation: process, outcome, and 
impact. 

The process level is concerned with the quality and 
quantity of program inputs and activities, and with 
the socio-political context in wiiich the program 
operates. Examples of the kinds of information 
subsumed by the process level include faculty and 
learner demographics, course descriptions, learner 
evaluations of courses, and context variables such 
as institutional support for curriculum changes. 

Piocess data are the basis for program manage- 
ment and monitoring systems and for evaluations 
designed for the purpose of program description. 
Process data also play several key roles in evalu- 
ations designed to assess program effectiveness. 
At the simplest level, process data allow a determi- 
nation of the actual program that is being evalu- 
ated. Programs as implemented are almost always 



different from programs as planned, and it is the 
effectiveness of the program as implemented that 
is of primary interest. 

Process data also aid in unraveling th*; usually 
complex outcomes of programs and program 
components. For example, answers to the question 
*'What pedagogical approaches work best with 
different categories of learners (e.g., undergradu- 
ates, post-graduates, fellows, faculty, etc.)?" are 
derived from an examination of differential effec- 
tiveness data in light of process information. 

The preceding discussion suggests a general rule: 
Assessments of program effectiveness should not 
be undertaken in the absence of a well developed 
process evaluation. 

The outcome level subsum<;s assessments of the 
accomplishment of specific program objectives 
concemed with changes in knowledge, attimdes, 
clinical behavior, patient outcomes, and so on. 
This simple concepmal definition belies the tech- 
nical complexities involved in implementing ade- 
quate evaluations at the outcome level. 

The technical complexities associated with out- 
come assessments derive from the necessity of 
meeting two challenges. First, measurement tech- 
niques must be located or developed that provide 
adequate representation of the outcome objectives 
of the program. Because substance abuse is a 
relatively new field in professional education, few 
standardized measures are available. Second, an 
evaluation design must be developed that suppons 
the inference that it is the program or program 
component under study that has caused changes in 
attitudes, knowledge, clinical behavior, or patient 
outcomes - that is, a design that rules out alterna- 
tive explanations of the observed changes. A large 
proportion of the metiiodological discussions in 
Section n are concemed with strategies for meet- 
ing these two challenges of outcome assessments. 

Evaluation at the impact level assess the broader 
effects of a program or the aggregate effects of 
several programs operating over time. Examining 



6 13 



the extent to which substance abuse training is 
institutionalized over a period of years would be 
one example of an impact evaluation. Impact 
evaluations also look for unintended program ef- 
fects, both positive and negative. For example, a 
well regarded substance abuse training program in 
one professional school might stimulate the devel- 
opment of similar programs in other professional 
schools in the university. Generally, such unin- 
ti^nded effects are discovered through descriptive 
dau collection, and only the most complex and 
costiy impact evaluations seek to smdy such ef- 
fects systematically. 

The Quantitative/Qualitative Dimension 

The distinction between quantitative and qualita- 
tive evaluation is often made in terms of the 
methods used to collect data. Thus, knowledge 
tests, closed-ended interviews, and counts of pa- 
tients referred for treatment are considered quanti- 
tative methods, while observation, open-ended 
interviews, and case histories are considered quali- 
tative metiiods. The former are characterized as 
systematic and quantifiable while the latter are 
considered less systematic and less quantifiable. 
This distinction rapidly breaks down, however, as 
it becomes clear that almost any data collection can 
be made systematic and all data can be quantified. 
For example, observational methods may use stan- 
dardized protocols and sophisticated time samples 
that yield precise data; open-ended interviews may 
be coded to produce reliable categories, and so on. 

A more useful distinction is to consider the general 
paradigm that underlies quantitative and qualita- 
tive analyses. Quantitative analyses derive from a 
hypoiherico-deductivc paradigm in which a hy- 
pothesis or series of hypotheses are tested by 
relating variations in independent variables to 
variation in dependent variables. Such analyses 
require a substantial body of knowledge about the 
phenomenon (program) under study and the mecha- 
nisms that are thought to underlie its operation and/ 
or effects. Here, precise definitions of variables 
and precision in measurement are key to the hy- 
pothesis testing process. 



Qualitative analyses dehve from a holistic-induc- 
tive paradigm which works backwards from the 
phenomenon (program) under study in an attempt 
to develop hypotheses about its operation and 
effects. Although dau collection may still be 
highly systematic, the evaluatorcasts a much wider 
net than in quantitative analyses, attempting to 
establish patterns and regularities on which to 
develop insights into the mechanisms underlying 
operations and effects. 

The choice of options along the qualitative/quanti- 
tative dimension will dq)end upon the level of 
existing knowledge concerning die program or 
program component under consideration and the 
nature of the questions being asked. Vihtrt there 
is sufficient knowledge to develop specific hy- 
potheses and to define relevant variables precisely, 
a quantitative approach is indicated. By contrast, 
when relatively litUe is known about a program or 
component (as might be die case in a newly initi- 
ated, novel pedagogical practice), a more qualita- 
tive approach will generally provide a richer yield 
of useful information. 

Frequendy, evaluations evolve from qualitative to 
quantitative. For example, a given department 
might be interested in exploring the mentoring 
component of a faculty development program. 
Very litUe may be known at first about how the 
component operates, its relevant features, even its 
feasibility. The evaluation might stan. therefore, 
by exploring the component qualitatively through 
semi-stmctured interviews with participating fac- 
ulty, their mentors, and other key informants in the 
department (e.g., administrative personnel). Once 
the specific variables of interest have been identi- 
fied qualitatively, they might be measured quanti- 
tatively through a closed-ended survey with quan- 
titative rating scales. 

Most successful evaluation efforts comprise a blsnc* 
of both quantitative and qualitative data. For 
example, a study of learner attitudes towards sub- 
stance abusers might rely on a paper-and-pencil 
assessment amenable to quantitative analysis sup- 
plemented by in-depth, open-ended debriefing of a 



o 

ERIC 



714 



smaller sample of learners. Here, the more precise, 
but necessarily nairow, data from the quantitative 
analysis are supplemented by the descriptive depth 
offered by the qualitative analysis. 

Quantitative data supplemented by qualitauve 
anecdotes are also a powerful tool for communi- 
cating evaluation results. Evaluation practitioners 
have long known that, although quantitative rigor 
is the sin qua non of program evaluation, most 
individuals are more compelled by anecdotes and 
case examples than they are by statistical graphs 
and tables. This is not to suggest that evaluations 
should therefore be based solely on anecdotes and 
case examples. Rather, the point is that such 
qualitauve data are useful communication tools 
when they arc consistent with and suppon quanti- 
tative findings. 

BENEFITS AND RISKS 

Benefits 

Many of the benefits of evaluation have already 
been described or alluded to in the preceding 
iiscussions. As suggested, the priniary benefit of 
evaluation is that it provides a rational basis for 
program decision making - including decisions 
conccrriing the most effective ways for integrating 
substance abuse content into professional educa- 
tion - by providing systematic input into the 
ascision making process. To this may be added at 
least four other benefits: 

Evaluation as an early warning system 

Evaluation as positive reinforcement 

Evaluation as burnout prevention 

Evaluation as legitimization 

As already suggested, evaluation activities can 
provide regular feedback on the operation of a 
program or program component. One of the most 
imponant functions of such activities is their abil- 
ity to serve as an early warning system, uncovering 



problems as they arise. A regular review of man- 
agement or monitoring system data allows the 
identification of problem areas which might other- 
wise go unnoticed, and to take corrective action 
before these problems become serious. 

Everyone is interested in feedback about job per- 
formance, and, in professional education, faculty 
are always interested in the impact their work is 
having on learners. Similariy, clinicians are ex- 
tremely concerned about the impact they are hav- 
ing on the health and well being of patients or 
clients. An evaluation system can be of direct 
benefit in meeting these needs and can provide 
positive reinforcement to individuals who might 
otherwise have little systematic way of assessing 
their own performance. 

Related to the reinforcing value of evaluation is its 
role in preventing "burnout.*' It has often been 
observed that the introduction of evaluation acr vi- 
nes into a program provides a new and interesting 
focus for individuals who may be somewhat jaded 
or discouraged. It is also the case that the self- 
examination that accompanies the introduction of 
evaluation activities into a program provides a 
positive forum for discussion of operational issues 
that mifht otherwise remain hidden. Repeated 
observation of this self-examination phenomenon 
has led some evaluators to refer to evaluation 
planning activities as the "evaluation therapy" 
process. 

Educating the public and other imponant constitu- 
ents about programs -- i.e., public relations - 
another imponant function of evaluation data. As 
noted, descriptive evaluations are often used as 
communication tools to answer the common ques- 
tion, "What exactiy does this program do?" Simi- 
larly, management or monitoring systems allow 
ready answers to the four journalistic "W's" (who, 
what, where, and when), issues often raised by 
outside individuals interested in program opera- 
tions. 

Finally, the legitimizing function of evaluation 
must be considered. All else being equal, a pro- 



id 

ERIC 



gram that is actively engaged in examining its 
activities and its impact will be viewed as more 
legitimate than a program that is not. The legid- 
macy that derives from an active program of evalu- 
ation is well wairanied -- programs that include 
evaluation as pan of their ongoing acuviiies are 
also more likely to be those programs that are 
effectively managed and whose growth and devel- 
opment are rational and planful. 

Risks 

The most commonly discussed risk of evaluation is 
that the data will reveal something negative about 
the program or its impact. This perspective derives 
from the view of evaluation as largely a "valu- 
ative" activity. As has been repeatedly empha- 
sized, the "valuative" function of evaluation is by 
no means its only, or even its primary function. 
Even when evaluations are designed to direcUy 
assess program worth or efficacy, evaluation re- 
sults are almost always a mixture of positives and 
negatives. Rarely, if ever, do evaluation results 
deliver the programmatic coup de gras. 

This is not to imply that evaluation activities arc 
without risks. However, these risks tend to have 
little 10 do with what the results of the evaluation 
reveal about the program. Rather, they involve the 
practicalities of introducing evaluation activities 
into the ongoing operations of a program. Specifi- 
cally: 

Evaluations can be threatening to those who have 
an investment in program success. 

Evaluation activities can interfere with other ac- 
tivities. 

Evaluations can take away resources from pro- 
gram activities. 

Despite all arguments to the contrary (including 
those presented in this manual), some people in- 
volved with a program will be initially threatened 
by the prospect of initiating evaluation activities. 
Even when it is understood that the evaluation will 



not be used to make judgments about the adequacy 
of performance (i.e., it is the program, not the 
people, that is being evaluated), evaluation can still 
be threatening because evaluation may ponend 
change and a threat to the status quo. In addition, 
many people harbor the suspicion (sometimes 
justified by previous experience with evaluators), 
that the tools of evaluation are ill-suited to measure 
the quality of certain program activities or to assess 
"intangible" outcomes. 

Experience suggests that the threat of evaluation 
can be reduced by including the individuals whose 
program is to be evaluated in the evaluation plan- 
ning process and by providing them with an active 
role in the interpretation of evaluation results. 
Moreover, certain evaluation components (e.g., 
management or monitoring systems) can be of 
benefit to faculty and other staff by providing 
regular feedback as discussed earlier. 

Experience also suggests that autocraiic attempts 
to impose evaluations on resistant participants will 
usually fail. The ways in which an unwanted 
evaluation can be undermined are too numerous to 
list " it suffices to say that without fuU coopera- 
tion, most evaluation activities will produce little 
in the way of useful information. 

Almost all evaluation activities obtrude, in some 
way, into ongoing program or institutional activi- 
ties. Such interference can be minimized by care- 
ful selection of evaluation methods, but the fact 
remains that additional activities will be required 
that are not always consistent with programmatic 
objectives. For example, an evaluation design may 
include the observation of interactions with pa- 
tients or clients, and it may be argued that the 
presence of the observer negatively affects the 
provider/patient relationship. Similarly, time 
devoted to an extensive knowledge or attitudes 
pre-test is time taken away from instruction. There 
is no "magic bullet" to resolve these conflicts 
between the evaluation agenda and the program 
agenda. Rather, a balance must be struck between 
compromise in the program and compromise in the 
evaluation design given the probable benefits of 



9 16 



'mplemencing a given evaluaoon activity. 
One final risk of evaluation bears consideration ~ 
the risk of misusing evaluation results. In some 
instances, slipshod or equivocal evaluations are 
used to suppon claims of program effectiveness 
that are not justified by the evaluation design or by 
the data. Here, the loss of credibility can be 
sufficiently damaging to the program or institution 
as to overshadow any benefits that might derive 
from the "positive" evaluation results. 

Similarly, design or implementation weaknesses 
in the program itself or in the evaluation may lead 
to negative evaluation findings that are not a true 
reflection of the effectiveness or potential effec- 
tiveness of the program. An otherwise sound 
program idea may be prematurely rejected. 

Both promotion of an unsound program and the re- 
jection of an effective one arc serious risks -- they 
can best be avoided by using the most rigorous 
methodology feasible and by interpreting evalu- 
ation results carefully. 



SECTION H: 
INTRODUCTION TO RESEARCH METHODS 



This section provides an overview of the methodo- 
logical issues involved in evaluating faculty devel- 
opment and clinical training programs in sub- 
stance abuse. The overview is intended to raise key 
issues in program evaluation research methods, 
and to relate these issues to the specific problems 
faced in the design and implementation of the 
various kinds of evaluation activities described 
earlier (monitoring and management systems, 
descriptive studies, accountability studies, etc.). 
Where appropriate, references are provided to more 
detailed, technical descriptions for exploration by 
the interested reader. 

SAMPLING 

Sampling involves strategies for selecting a por- 
tion of cases from or about which data will be 
collected with the goal of making generalizations 
about the larger population from which the cases 
are drawn. For example, in evaluations focused on 
learner outcomes, a case might be a single learner, 
the enrollees in a course, or all the learners in a 
dcparmient. Similarly, the population to which 
generalizations are to be made might be all learners 
enrolled in a given course, all learners in a given 
clinical specialty, all dcpanments within a profes- 
sional school, and so on. 



In some instances, an evaluation design may call 
for data collection from or about all cases in a 
population. This might be the case, for example, in 
a management or monitoring system that tracks 
learner outcomes from year to year based on stan- 
dardized test scores. More often, however, it is not 
practical, or even desirable, to collect data from an 
entire population. Under these circumstances, a 
sampling strategy is required. 

Sampling strategies may be divided into two broad 
categories -- those that yield samples with known 
or measurable sampling errors, and those that do 
not. Sampling error is a measure of the precision 
with which the parameters of a sample represent 
the parameters in the larger population. From the 
standpoint of methodological rigor, samples with 
known sampling errors are clearly preferred. 
However, such samples may not always be practi- 
cally or conceptually feasible, and other sampling 
strategies must, out of necessity, be employed. 

All samples, with or without a known or measur- 
able sampling error, may be further categorized 
according to their variability. In general, samples 
with lower variability are preferred, owing to the 
inverse relationship between variability and statis- 
tical power (i.e., the ability to detect effects if they 
are present). It is imponant to note that the 
preference for samples with lower variability applies 



ERIC 



]8 



u 



in qualiuuive studies may be quite different - in 
qualitative studies, samples may be constructed in 
such a way as to increase variability in order to 
gather as div^srse a data base as possible. 

The following sections describe the major sam- 
pling strategies used in program evaluation. S trate- 
gies yielding samples with and without known 
sampling errors are discussed separately. 

Samples with Known Sampling Errors^ 

Simple Random Sampling: In simple random 
sampling, each population member (case) is as- 
signed a unique number. The sample is then 
selected via use of random numbers. Simple 
random sampling has three advantages: 1) it 
requires only minimum knowledge of the popula- 
tion a priori, 2) it avoids classification errors, and 
3) it facihtates analysis of data and computation of 
errors. Disadvantages of simple random sampling 
include that it does make use of knowledge of the 
population which might be possessed by the evalu- 
ator. and it yields larger errors ( for the same sample 
size) than does stratified sampling. 

Systematic Sampling: Systematic sampling ex- 
ploits the natural ordering of a population. A 
random starting point is selected between the number 
one and the nearest integer to the "sampling ratio" 
- defined as the ratio (n/N) between the size of the 
sample (n) and the size of the population (N). 
Items are then selected at the interval nearest (at the 
whole number) to the sampling ratio. If the popu- 
lation is ordered with respect to some peninent 
property (e.g., department, clinical specialization), 
then systematic sampling yields a stratification 
effect (e.g., insures roughly equal representation 
of departments or specialties). This stratification 
reduces variability compared to a simple random 
sample - the major advantage of systematic sam- 
pling. In addition, systematic sampling facilitates 
both the drawing and checking of the sample. 

However, if the sampling interval is related to a 
periodic ordering of the population (e.g., sub- 
stance abuse-related emergency room admissions 



by time of day or day of the week), increased 
variability may be introduced. This is the major 
disadvantage of systematic sampling. When there 
is this type of stratification effect, estimates of 
eitor are likely to be high. 

Multistage Random Sampling: Multistage ran- 
dom sampUng involves stages, all of which are a 
form of random sampling. For example, in a 
follow-up study of patient outcomes, clinical sites 
might be randomly selected (stage one), and then 
patients or clients randomly selected within sites 
(stage two). A major advantage is that sampling 
lists, identification, and numbering are required 
only for units belonging to subgroups (e.g., clini- 
cal sites) acmally selected. If sampling units are 
geographically dispersed (e.g., satellite clinics), 
multistage random sampling cuts down on the time 
and trouble needed to collect data. 

On the negative side, errors are likely to be larger 
for multistage random sampling than for simple 
random or systematic sampling for the same sample 
size. EiTors increase as the number of sampling 
units selected decreases - if, for example, only a 
small number of satellite clinics can feasibly be 
sampled. 

In a major variant of multistage random sampling, 
sampling units are selected with probability pro- 
portionate to their size. This procedure has the 
advantage of reducing variability. Its major disad- 
vantage is that lack of a priori knowledge of the 
size of each sampling unit increases the variability. 

Stratified Sampling: Stratified sampling strategies 
make use of knowledge about characteristics of the 
population (e.g., age, sex, and ability to pay of 
patients in a clinic) which may be related to de- 
pendent variables (e.g., patient referrals). There 
are three major variants: I ^ proportionate sam- 
pling, 2) optimum allocaticn sampling, and 3) 
disproportionate sampling. Each of these three 
variations is discussed in turn. 

In proportionate stratified sampling, selection from 
every sampling strata is random witii probability 



ERIC 



12 19 



« 



proponionaie to size. This assures representation 
with respect to the property or properties which 
define the strata and, therefore, yields less variabil- 
ity than simple random sampling or multistage 
random sampling. Proportionate stratified sam- 
pling also decreases the chance of failure to include 
members of the population because the classifica- 
tion process helps insure that a wider range on the 
variables that define the strata will be included. In 
addition, the stratified sample facilitates compari- 
sons between or among strata (e.g., adult patients 
vs. adolescent patients). On the negative side, 
proportionate stratified sampling requires accurate 
information on the proportion of the population in 
each stramm. Otherwise, error will be increased. 
If stratified lists are not available, these may be 
costly to prepare. There is also the possibility of 
faulty classification and hence, increased variabil- 
ity. 

Optimum allocation sampling procedures are the 
same as those in proportionate sampling, except 
that the sample is proportionate to the variability 
within strata as well as to their size. This assures 
that there will be less variability for the same 
sample size than in proportionate stratified sam- 
pling. The major disadvantage is that optimum 
allocation requires knowledge of variability of 
pertinent characteristics within each stratum. For 
example, it may be useful to classify patients 
within clinics according to variability in the nature 
and extentof substance abuse problems. However, 
accurate assessments of such variability may be 
difficult to obtain. 

Disproportionate stratified sampling proceeds as 
in the proportionate and optimum allocation vari- 
ants, except that the size of the sample is not 
proportionate to the size of the sampling units, but 
is determined rather by analytic considerations or 
convenience. For example, one might choose to 
oversample a cell within a stratified sampling 
design if the proportion of this cell within the 
general population is so low as to yield unanalyz- 
able results (for example, ovcrsampling patients of 
a particular sex, age, and usage patterns such as 
cocaine addicted pregnant teenagers). 



Dispxopoitionate soratified sampling is more effi- 
cient than proportionate stratified sampling for 
comparison of strata or where different errors are 
optimum for different strata. The major disadvan- 
tage is that disproportionate stratified sampling is 
less efficient than proportionate sampling for de- 
termining population characteristics. That is. it 
yields higher variability for the same sample size. 

Cluster Sampling: In cluster sampling, sampling 
units are selected via some form of random proce- 
dures. The ultimate units are groups (e.g., aclinic, 
enroUees in a course). Cluster sampling has sev- 
eral advantages: 1) if clusten are geographically 
defined, it yield the lowest field costs, 2) it requires 
listing only units in selected clusters, 3) character- 
istics of clusters as well as those of the population 
can be estimated, and 4) it can be used for subse- 
quent samples because clusters rather than units 
are selected, and substitution of units may be 
possible. The disadvantages are larger errors for 
comparable sample sizes than with other probabil- 
ity samples, and the requirement that each member 
of the population be uniquely assigned to a cluster 
~ inability to do so may result in duplication or 
omission of units. 

Stratified Cluster Sampling: Stratified cluster 
sampling is to cluster sampling what stratified 
sampling is to simple random sampling. That is. 
sampling strata are defined based on luiown char- 
acteristics of the clusters, and clusters are selected 
at random from every sampling strata. Stratifica- 
tion reduces the variability of ordinary cluster 
samphng, but combines the disadvantages of strati- 
fied sampling with those of cluster sampling. In 
addition, because cluster properties may change, 
the advantage of stratification my be reduced, and 
the sample may not be usable for subsequent 
research. 

Sequential Sampling: Sequential sampling is a 
procedure whereby two or more samples of any of 
the types discussed above are taken, with results 
from earlier samples used to design later ones (or 
to determine if they are necessary). Sequential 
sampling provides estimates of population charac- 



ERIC 



^3 20 



tehsdcs that facilitate efficient planning of suc- 
ceeding samples* and thereby reduces errors in the 
final esdmate. In the long run, sequendal sampling 
also reduces the number of observadons required 
It has several disadvantages: 1 ) it complicates the 
administration of field work« 2) it requires more 
computadon and analysis than does non-repeddve 
sampling, and 3) it can be used only where a very 
small sample can approximate representadveness 
and where the number of observadons can be 
increased conveniendy at any stage of the research. 

Samples Without Known Sampling Errors 

Convenience Sampling: The simplest sampling 
technique is convenience sampling. Here, the 
evaluator collects data on those individuals who 
are most readily available. An example would be 
an assessment of changes in learners' padent inter- 
viewing skills using learners assigned to a certain 
clinic chosen on the basis of geographic proximity 
(i.e.. it is easy to get to) or timing (the clinic hours 
fit die schedule of die data collectors). Aldiough 
the convenience sample is one of the weakest 
available sampling options, it is usually sufficient 
for pilot studies or for studies aimed at very general 
assessments. 

Quota Sampling: Quota sampling attempts to 
overcome some of the disadvantages of conven- 
ience sampling by intioducirig stratification based 
on a priori knowledge of a population. For ex- 
ample, in a study of case notes entered by learners 
exposed to an educational innovation, clinic rec- 
ords might be used to determine the proponion of 
male and female patients in different age groups, 
and ''quotas" set for male and female patients of 
different ages that represent die population propor- 
tions. Quota samples reduce variability and in- 
crease representativeness, although die selection 
of cases widiin categories (e.g., male, female) may 
still be highly biased. 

Modal Sampling: Modal sampling involves the 
selection of a sample that is judged to be represen- 
tative of a population based on knowledge of the 
general characteristics of the population of inter- 



est For example, experience might suggest diat 
patients in a given clinic tend to fall into categories 
characterized by a set of related characteristics 
(e.g., young males who are heavy drinken but not 
alcoholics; older males who are either light drink- 
ers or alcoholics, etc.). The modal sample is 
selected to include individuals from each of the 
various categories which, taken togedier, are diought 
to represent the range of patients in that popula- 
tion. Modal sampling is particularly useful in 
evaluations where a large amount of resources are 
to be devoted to the study of a small number of 
cases, as might apply when a study requires lengthy 
interviews with patients. 

In general, the quality of a modal sample will 
depend upon the extent and sophistication of the 
evaluator' s understanding of die characteristics of 
the population. Like quota sampling, modal sam- 
pling has die advantage of increasing representa- 
dveness and decreasing variability, and like quota 
sampling, the possibility of biased selection within 
modal categories is a major disadvantage. 

Purposive Sampling: Purposive sampling is used 
in qualitative evaluation to ensure that information 
is obtained from each group or individual of inter- 
est For example, in order to describe the changes 
that have occurred as the result of the introduction 
of a new curriculum, key informants (such as the 
department chair and some professors) might be 
selected from each participating department 

Snowball Sampling: Snowball sampling is a vari- 
ation on purposive sampling. Key informants are 
selected and each of diese informants is asked to 
suggest odier people from whom data should be 
collected. For example, investigators interested in 
the opinions of students most interested in alcohol 
and other drug abuse ' ~sues might ask the profes- 
sors of key courses to suggest students to be 
interviewed. Those students might in turn be asked 
to suggest other students who share their interest in 
alcohol and other drug abuse topics. This process 
could continue until a sample of sufficient size has 
been identified or until most of the suggested 
informants have already been named by others. 



ISSUES RELATED TO SAMPLE SIZE 

One of the most copimonly asked questions in 
planning an evaluation is: "How large a sample is 
required?" Actually, there are two different ques- 
tions related to sample size. The fifst question is: 
"How big a sample is required in order to make 
reasonable generalizations from the sample to the 
larger population?" The second question is: "How 
big a sample is needed in order to make meaningful 
inferences about effects?" 

Sample Size and Generalizability 

The first question relates to the issues of sampling 
error discussed earlier - that is, how well does the 
sample represent the population. In general, the 
larger the sample size, the smaller the sampling 
error, and hence, the better the representation of 
the population of interest. Specific formulae for 
calculating sampling error for different sample 
types and strategies can be found in Kish (1967).^ 
As in most things, a law of diminishing renmis 
applies to sample sizes. As sample size increases, 
the relative increase in precision derived from each 
additional case decreases. Thus, evaluators some- 
times refer to the "net gain from sampling," a 
comparison between the increased cost in terms of 
data collection, data analysis, etc., of each addi- 
tional case and the concomitant benefit in terms of 
increased precision. 

When sampling strategies are employed that yield 
samples for which sampling error cannot be meas- 
ured, the sample size is largely a practical judge- 
ment based on the available resources for the 
evaluation, and a consensus among those who will 
use the evaluation results. Here, the issue becomes 
one of negotiating die number of cases needed to 
instill confidence in tiiose individuals who will be 
using the information tiic evaluation produces.. 

Sample Size and Statistical Power 

The second question relates to statistical power, 
that is, the ability of inferential statistics to detect 
effects when such effects exist. The power of a 



statistical test represents a complex relationship 
among three factors: 

The probability of mistakenly rejecting the hy- 
pothesis that there is no effect when, in fact, no 
effect exists (referred to as alpha) -- as alpha 
increases, power increases 

The magnitude of the true effect size in the popula- 
tion - as the magnitude of the effect increases, 
power increases 

The sample size - as the sample size increases, 
power increases. 

Knowledge of two of these three parameters al- 
lows a calculation of the third. Alpha is set by the 
evaluator. Thus, knowledge of the true effect size 
allows a calculation of the sample size needed to 
obtain a given level of power. Conversely, for a 
given sample size, the minimum detectable effect 
can be estimated. Procedures for making these 
calculations are found in Cohen (1969) and Cohen 
and Cohen (1975).' 

Unfortunately, the true effect size is generally not 
known. Cohen and Cohen (1975)* suggest three 
strategies which allow a calculation of sample size 
in the absence of a priori knowledge of the true 
effect size. First, otiier similar studies may be 
consulted in order to ascertain the probable effect 
size in a given situation. For example, in order to 
estimate the expected change in learner knowledge 
scores as a result of panicipation in a given course, 
one might examine changes observed in other 
studies of similar courses. 

Second, one might consider the size of effect that 
would make a practical difference - e.g., how big 
a change in clinical practice behavior would be 
required to make it worth implementing a given 
strategy to improve a given clinical skill? 

Finally, Cohen (1969)' has suggested conventions 
for "small," "medium," and "large" effect sizes 
based on the effects generally observed in the 
social sciences. In the absence of other informa- 



ERIC 



f 



tion, sample sizes may be calculated for all three 
''conventional'* effect sizes in order to establish a 
range of sample sizes for a given evaluation smdy . 

THE VALIDITY OF EVALUATIONS 
AND POTENTIAL SOURCES OF BL\S 

The overall goal of the various social science 
methods used in evaluation is simply to ensure, to 
the greatest extent possible, that the information 
produced is accurate enough to yield valid conclu- 
sions useful in decision making. Stated another 
way, the goal of these methods is to reduce, to the 
greatest extent possible, the biases which may 
affect the validity of evaluation conclusions. 

Issues Related to External Validity 

All evaluation seek to maximize external validity - 
' that is. the extent to which the results of a given 
evaluation are generalizable beyond the highly 
specific conditions under which the results are 
produced. External validity will be of concern 
whether the evaluation is a management or moni- 
toring system, a descriptive evaluation, or a com- 
plex assessment of program outcomes or impacL 
In general, the biases that impair external validity 
are to be found in the ways that samples are drawn 
and measurement is conducted. 

Samplinq Biases: Several issues related to sam- 
pling have already been discussed. As suggested, 
various sampling techniques yield samples which 
estimate population parameters with varying de- 
grees of precision, and hence, will yield evaluation 
results with varying degrees of external validity. 
When samples are not systematic, and often they 
are not, biases will be introduced that decrease the 
meaning of results beyond the specific population 
from whom data are gathered. At the very least, 
these biases should be acknowledged in reporting 
evaluation results -- better still, attempts should be 
made to estimate and/or correct for their effects. 

Two of the most common sample biases that arise 
in program evaluation -- and these obtain no matter 
how the initial sample is drawn - are refusal bias 



and attrition bias. Refusal biases may occur when- 
ever a study population is given the option to refuse 
to participate in some or all of the data collection 
activities. The ethics of evaluation (and the mles 
governing all Federally funded research) require 
that subjects be given the option to refuse coopera- 
tion, to terminate cooperation at any time, and to 
refuse to answer individual questions. Although 
careful attention to die rights and concerns of smdy 
participants can reduce refusal, some refusal is 
inevitable in most evaluation data collection. For 
example, a common example of refusal bias is non- 
renim of mail surveys. The question then arises: 
'*How are subjects that refused to provide data 
different from those who did not refuse?'*. That is, 
what biases are introduced by refusal? 

There are a number of important ways in which 
those who refuse to cooperate may differ from 
those who do not. For example, it might be 
assumed that the patients most likely to refuse to 
participate in a survey of the incidence of alcohol 
and other drug problems in a given clinic popula- 
tion are those with the most serious alcohol oi other 
drug problems. Estimates of problem rates will be 
biased if those with the worst problems refuse to 
cooperate. A completely different son of bias 
might be introduced if those with the least educa- 
tion refuse to cooperate to avoid the embarrass- 
ment of admitting tiiat they cannot read or do not 
understand the data collection instmment 

Refusal biases may also arise in system level 
studies. For example, one or more depanments 
might refuse to participate in a study of depanmen- 
tal suppon for the introduction of substance abuse- 
related curricular content. Is such a refusal the 
result of a desire to prevent the discovery of 
existing problems? Alternately, does the depart- 
ment chair believe that the department is already 
doing an adequate job in this area and that no 
benefit will derive to the department from the 
study? In either case, the study would be biased by 
the lack of data from the non-cooperative 
department(s}. 

Related to refusal biases are attrition biases •> 



ERIC 



16 23 



« 



those biases that arise when study participants do 
not complete the study. Like refusal, attrition 
raises the concern that those who do not complete 
the study may be systematically different from 
those that do. So, for example, a learner sadsfac* 
tion survey taken at the end of a course oiay show 
inflated satisfaction because it does not include 
those highly dissatisfied individuals who dropped 
out Similarly, a comparison might be made of 
lifestyle changes observed in patients counseled by 
clinicians who have or have not been exposed to an 
educational innovation. This comparison will be 
biased if attrition of heavy drug or alcohol users is 
differential between the two clinician groups. 



denial, social drinking, and so on. To say that a 
given intervention is successful in ideniiiying "social 
drinkers'* but not alcoholics requires a consensus 
about V, hat is meant by these terms, and a consen- 
sus that the measurement techniques used to assess 
them are adequate representations . 

Even as simple a construct as whether or not a 
learner completed a course raises defmitional and 
measurement issues. Does the learner have to 
attend all sessions? If not, how many may he or she 
miss? Does the learner need to do all the required 
reading in order to be categorized as a course 
completer? 



Assessment of the impact of refusal or attrition Assuming that a satisfactory defmition of a given 
biases generally relies on comparisons of the char- variable is available, and assuming that one can 
actehstics of cases from which complete data are adequately measure the variable in some practical 
available and those cases from which they are not. way, concern may still be raised abut systematic 
Of course, such a strategy requires some minimum over- or underesdmaoon. For example, if a learner' s 
level of data on lost participants. Thus, even diagnostic classifications are being compared to 
patients who refuse to complete a questionnaire or the results of a written assessment tool, does the 
interview might be asked if they would be willing assessment tool over- or underestimate the number 
to provide basic demographic information. Some- of patients with a given diagnosis? 
times, assessment of the effects of refusal or attri- 
tion biases must be based on very rudimentary data Issues Related to Internal Validity 
such as race, sex, and approximate age. Such data 

might be gathered from case necoids or even through Internal validity refers to the extent to which a 

observation. given evaluation design allows one to rule out 

alternative explanations for the effects observed. 
Measurement Biases: Accurate generalizations In general, then, internal validity is of concern 
are possible only to the extent that accurate meas- when one is attempting to derive causal relation- 
ures are used in the evaluation. In general, the ships among variables (e.g., participation in a 
external validity of evaluation studies is threatened given practicum caused improvements in clinical 
when measures either fail to accurately represent skills). The various threats to internal validity may 
the phenomenon of interest (so-called failures of be defined in terms of various challenges to the 
construct validity), or systematically over- or assertion that "exposure to X caused effect Y." 
underestimate the phenomenon of interest. Of Discussions of the common threats to internal 
course, measures may also be "noisy" or unrcli- validity follow.' 
able. However, low reliability is generally consid- 
ered to be a threat to internal, rather than external History: Historical threats to internal validity refer 
validity. to specific-events or conditions - in addition to die 

treatment or experimental influence - which occur 

The issue of construct validity is a major concern between first and subsequent measurements of the 

in evaluations of substance abuse programming, dependent or outcome variable, and which might 

How docs one measure (or even define) such key influence the magnitude of observed differences, 
constructs as alcoholism, addiction, dependency, 



ERIC 



17 r 



?4 



For example, an observed change in attinides re- 
garding alcohol and other drug abusers might be 
taken as evidence of the success of an educational 
intervention. However, such changes might actu- 
ally have resulted from a widely publicized inci- 
dent such as the overdose death of a famous athlete 
or a major oil spill attributed to alcohol abuse. 
Similarly, increased identification of drug abusers 
in the emergency room might result from the 
increased diagnosdc skills of E.R. staff, but might 
also be attributable to actual increases in the number 
of drug abusers who are present at the E.R. during 
the snidy period. 

Maturation: Maniradon refers to changes in study 
participants between pretest and subsequent test- 
ing which influence the observed outcome, but are 
not a pan of the program or treatment of interest. 
Participants may grow older, wiser, stronger, or 
undergo biological change. 

For example, it is well known that drug and alcohol 
use generally increases throughout the teenage 
years, peaks in early adulthood, and then declines. 
Thus, depending upon the age of padents in a 
smdy, use rates may either increase or decrease 
over time independent of the exposure to any 
clinical intervention. Similarly, most psychopa- 
thology (including alcohol and other drug prob- 
lems) is associated with a ^'spontaneous recovery 
rate" -- i.e.. some individuals will improve in the 
absence of any treatment 

Testing: Sometimes, simply taking a test affects 
panicipant's subsequent performance on the same 
test. In pretest/post-test designs, the concern is 
with the influence of the pretest experience on the 
post-test score. For example, pretesting learners' 
attitudes concerning alcohol and other drug abus- 
ers may sensidze learners to the fact that attitude 
change is desired. The potential influence of test- 
ing is ever more likely in time scries or repeated 
measurements designs where multiple testing is 
required. 

Instrumentation: Instrumentation artifacts arise 
when changes occur in measurement instruments 



or procedures over repeated observations. For 
example, measures derived from observations of 
clinician-patient encounters may change in some 
unknown way as the smdy progresses and repeated 
observations are made. Observers may become 
more skilled, bored, or inferential as they rate 
samples of clinician behavior. Similarly, if meas- 
ures are based on interview data, the quality of the 
data may change as interviewers gain experience 
and confidence, become more familiar with the 
interview schedule, or gain insight into the content 
of the interviews. 

"Objective" data sources are also subject to instru- 
mentation artifacts. For example, patient record 
keeping systems may be altered or improved, new 
or refmed diagnostic categories might be added to 
standardized nosologies, laboratory tests may 
become more sensitive, and so on. 

Statistical Regression: Statistical regression arti- 
facts may arise when study participants are classi- 
fied into or selected from extreme groups on the 
basis of pretest scores, correlates of pretest scores, 
or some other basis. When participants are as- 
signed to groups on the basis of high or low pretest 
scores, the high groups will tend to score lower, 
and the lower groups higher at the post-test. For 
example, if patients are assignee .o a specialized 
intervention based on the severity of their alcohol 
or other drug problems, these individuals will tend 
to exhibit fewer problems in the future due simply 
to statistical regression. Similarly, the often ob- 
served effect in psychotherapy that very disturbed 
individuals tend to improve more than less dis- 
turbed individuals is probably due to statistical 
regression effects rather than to a relationship 
between the extent of psychopaihology and the 
effectiveness of psychotherapy. 

Statistical regression effects are especially likely if 
the pre- and post-tests are unreliable, or include a 
considerable amount of measurement error. Un- 
der these conditions, changes from pre- to post-test 
are very likely due to statistical regression, and 
attributing such changes to program influences 
would almost certainly be incorrect 



18 fjtz 



Selection: As used here, selection refcn to a 
treatment effect due to a lack of inidal equivalence 
between study groups. For example, the effects of 
an innovadve pedagogical strategy might be tested 
by recruidng volunteers for the new strategy from 
all learners enrolled in a paixicular course. Under 
these condidons. differences observed between die 
two groups of learners noight be due to initial 
differences between those who volunteer and diose 
who do not, radicr dian to any particular benefit of 
the new strategy. 

Mortality: Mortality artifacts arise when partici- 
pants drop out as the evaluation progresses from 
inception to completion. Here, differences be- 
tween pretest assessments and subsequent assess- 
ments may reflect changes in the study population 
ratiier than any effects of the intervention. 

In comparisons among strategies, there is always a 
concern that mortality will be differential - that is, 
different types of participants will be lost in the 
different treatment conditions. For example, a 
comparison might be made between two mediods 
for improving learner attimdcs towards abusers. 
The first might involve a series of lecmres, while 
the second might include Iccmres plus guided 
discussion of the learners' personal experiences 
with alcohol and other drug abuse and with abus- 
ers. The latter strategy might cause learners with 
the most negative attitudes or most serious per- 
sonal problems sufficient discomfon that they 
drop out of the course. Thus, differential changes 
in attitudes in the two groups would be observed, 
but these changes could not reasonably be attrib- 
uted to differences between the two educational 
metiiods. 

Interactions with Selection: When cases are not 
randomly assigned to conditions, the possibility 
exists that certain characteristics of study groups 
will interact with other threats to internal validity 
and produce additional spurious treatment effects. 
For example, the influence of maturation on a 
treatment group might be different from that on a 
comparison group if the treatment and comparison 
groups are not initially equivalent. This might be 



the case for example, if patients exposed to a 
treatment are drawn from two clinics that differ in 
their age distributions. 

Ambiguity About the Direction of Causal Influ- 
ence: The direction of causal influence is a threat 
to internal validity in correlational studies, espe- 
cially when an equally plausible argument can be 
developed for the conclusion that A causes B, or B 
causes A. For example, a smdy might assess the 
association between learner satisfaction with a 
course and die degree to which die learners apply 
the course content in clinical practice. One might 
conclude from such a smdy that learners who liked 
the course are more likely to apply it dian learners 
who do not. Equally plausible, however, is die 
conclusion that learners who apply the content are 
more likely to rcpon diat diey liked die course dian 
are learners who do not apply the content. 

COMPARISON AND CONTROL GROUPS 

Imagine the following scenario: 

Learners entering an innovative clinical pracncuum 
are given a battery of assessments, including alco- 
hol and odier drug knowledge and attitudes, accu- 
racy in diagnosing dependency, and frequency of 
referrals of patients to treatment. Following the 
practicum, the assessment battery is repeated. 
Comparisons of the test scores prior to and after die 
practicuum reveal that knowledge and attitudes 
improved, diagnostic accuracy remained about the 
same, and frequency of referrals went down. 

What can be concluded about die effects of the 
practicum from the results obtained? Based on the 
material discussed thus far. the answer is: "Not 
much!" 

The reason that not much can be concluded from 
this hypothetical study is that there is no basis for 
ruling out a variety of other explanations of the 
observed results ~ that is, no way of assessing 
threats to internal validity that may have caused die 
observed changes (or lack of change) in the ab- 
sence of any program effects. For example, the 



ERLC 



19 



y 



observed improvement in knowledge and attitudes comparison groups is the fact that, for some indi- 
might be a result of the learners' exposure to ar- viduals, the opportunity to panicipaie m a given 
tides appearing in widely read journals, whUe the program (i.e., the one under study) must be wiih- 
observed reductions in referrals might result from held. In smdies of patient outcomes, withholding 
the fact that a major treatment facility stopped ac- a treatment seen as poicntiaUy beneficial may 
cepang new patients (both are historical artifacts), cause ethical or legal problems. However, a number 
Even the lack of change in diagnostic accuracy is of options exist for constructing control and/or 
suspect -- perhaps diagnostic accuracy would have comparison groups which may be accepublc and 
gotten worse without exposure to the practicum as feasible within a given setting. These are dis- 
leamers were assigned increasingly more difficult cussed below, 
patients (an instrumentation artifact). 

The term "control group" is used throughout the 
The single majorsolution to the problems raised by discussion that follows to refer to both control 
threats to internal validity is the use of control or (randomly created) and comparison (non-randomly 
comparison groups - that is, the testing of indi- created) groups. Where no distinction is made, 
viduals who are identical to die treatment popula- either a random or non-raniom assignment strat- 
tion in every way, except for exposure to the egy is possible. --i.e., either a control or compari- 
treaiment. The results obtained for the control or son group is possible. In cases where only a non- 
comparison group are those which are attributable random assignment strategy is possible, the term 
to history, maturation, testing effects, etc. The "comparison group" is used, 
residual differences observed in the treatment group 

are those attributable to the treatment. So, for The True No-Treatment Control 
example, if a control group were used in the smdy 

discussed, and if these individuals showed de- In some evaluations, the most desirable control 
creased diagnostic accuracy over the smdy period, condition is one in which some cases simply do not 
the practicum would be deemed successful by receive the treatment under smdy. For example, 
vinue of having stenmied this rate of decrease. participants in an innovative substance abuse prac- 
ticum might be randomly selected from among 
The difference between a control and a comparison those learners who express an interest in paruci;at- 
group is that control groups are developed by ing. Those not selected to participate serve as the 
randomly assigning cases to treatment, while control group. Such a strategy is often justifiable 
comparison groups are developed in any other onpracticalgrounds'- '^en the number of available 
way. The great majority of evaluation studies rely slots exceeds the nui. £T of interested learners, 
on comparison groups rather than randomly as- 
signed controls. Unformnately, the ability of The Waiting List Control 
comparison groups to rule out certain threats to 

internal validity is often weak, and statistical meih- The waiting list control is implemented by assign- 
ods for adjusting initial group differences, al- ing some cases to receive the program now and 
though commonly employed, can never approxi- some cases to receive the program later. Ail cases 
mate the inferential power that derives from true are pretested before the "now" group receives the 
control groups. On the other hand, a non-ran- program and arc post-tested after the "now" group 
domly created comparison group is much better completes the program. The "later" group thus 
than no comparison at all, and may be the only serves as a non-treated control during the waiung 
practical option in many evaluations. period. As is the case with the no-treatment 

control, the waiting list control provides a practical 
A major roadblock to the use of either control or solution when demand for a program exceeds 

supply. The waiting list control has the disadvan- 

erIc ^° ^' 



tage that the waiting list may not truly provide a ment to such a group is necessarily non*random. 
"no treatment" comparison in that individuals may For example, learners whose schedules preclude 
receive some "stop gap" services. In addition, it them from taking an elective course on patient 
precludes any long-term comparisons. As soon as interviewing might constimte a group to which 
those on the waiting list enter the program, all learners who take the course may be compared, 
comparisons between the groups arc nullified. This strategy presents a number of conceptual 

problems since the criteria that determine eligibil- 
The "Traditional Treatment" Control ity or non-eligibility may be related in a variety of 

ways to the outcomes of interest 

The "traditional treatment" control compares an 

innovative strategy to an existing strategy (tradi- MEASURING OUTCOMES 

tionai treatment). For example, an innovative 

training program might be compared to an existing Faculty development and clinical training pro- 
training program. Although this design does not grams seek to attain a variety of outcomes, includ- 
allow an assessment of the absolute effects of ing improvements in knowledge, attitudes, and 
either the "traditional" or the innovative strategy, clinical skills, organizadonal change, and improve- 
an assessment of the relative effectiveness is avail- ments in patient outcomes. The specific out- 
able, comes measured in an evaluation of any aspect of 

a faculty development or clinical training program 
The Minimal Treatment Control will be determined by three factors; I ) the specific 

objectives of the program, 2) the questions the 
The minimal treatment control is one in which evaluation is designed to answer, and 3) the scope 
minimally acceptable treatment (defmed by law, and time frame of the evaluation effort, 
licensing guidelines, program standards, or ethical 

considerations) is used as a comparison. For The relationship between program objectives and 
example, in-depth anticipatory guidance for par- program outcome measures cannot be emphasized 
ents of adolescents could be compared to a condi- too strongly. Perhaps the most common error in all 
cion in which parent education pamphlets are sim- outcome evaluadons is the use of measures that do 
ply available in a clinic waidng room. not relate directly to what the program is attempt- 

ing to accomplish. For example, improved aware- 
The Placebo Control ness of learners' own use of alcohol and other 

drugs is a worthwhile goal, and the evaluator may 
When a control treaonent is so minimal that it is be motivated to measure changes in this area as 
expected to have no effect whatever, it is referred pan of the outcome evaluation of an educational 
to as a placebo control. Sometimes, placebo con- innovation. But if the program is not specifically 
crois arc used to determine the potential operation designed to increase awareness (for example, if the 
of so-called "Hawthorne Effects" -- that is, the innovation is designed to teach pharmacology), the 
effects of simply paying attention to study panici- choice to measure awareness may resuh in an 
pants. artificial finding of program ineffectiveness - i.e., 

awareness does not increase because there is no 
The Non-Eligible Comparison Group panicular reason why it should. 



Some evaluation studies use individuals who are 
not, for one reason or another, eligible to receive 
the treatment as a comparison group. The term 
"comparison group" is used here because assign- 



Evaluators must also consider the use to which 
outcome information is to be put. In some cases, 
such as a summative evaluation of a well estab- 
lished program or program component, highly 



ERIC 



21 



precise (and probably expensive) measures of 
outcome may be required. Indeed, as discussed 
below, multiple, independent outcome measures 
would be appropriate in such an evaluadon. In 
other cases, however, less precise (and less expen- 
sive) measures may be sufficient. For example, a 
preliminary assessment of the effecdveness of an 
innovadve educadonal strategy might be accom- 
plished by asking for student feedback about whether 
the innovadon seemed useful and appropriate. 
Such measures may be all that is required to 
evaluate whether or not the program element is 
worth developing further. 

Finally, the scope and the time frame of the evalu- 
ation must be considered. For example, long-term 
padent outcomes might be a very important indica- 
tor of the success of an effon to improve clinical 
skills. However, such an assessment may require 
snidying large numbers of padents or following 
pauents for significant periods of time. This is 
especially the case if the incidence of alcohol and 
other drug problems in the padent populabi/n is 
low e.g., adolescents or pre-adolescents. 

: is also the case that many outcome measures 
^e.g., the percentage of padents for whom alcohol 
and dmg histories are taken) might show signifi- 
cant changes immediately following exposure to a 
seminar on history taking. However, these changes 
may decay within a short period of rime. There- 
fore, if a reasonable follow-up period cannot be 
accommodated, the evaluator may choose todeem- 
phasize such measures in order to avoid basing 
program decisions on highly unstable outcomes. 

Generally, the outcomes of faculty development 
and clinical training programs are measured in one 
of four ways: 1) self-repons, 2) other-repons, 3) 
observation of behavior, and 4) review of records. 
The strengths and weaknesses of each of these 
measurement techniques follow. 

Before proceeding to a discussion of each of these 
types of outcome measures, it is important to note 
that the most conclusive evaluations of outcome 
. rarely rely on a single type of outcome measure. 



This is because all outcome measures have weak- 
nesses that limit the strength of conclusions based 
on them. Thus, when maldple measures are em- 
ployed, the strengths of one measure may balance 
the weaknesses of another. Moreover, to the extent 
that the various measures coincide (i.e., suggest 
the same conclusion), the inferential strength of 
the evaluation increases. The use multiple meas- 
ures is an example of "triangulation" in social 
science research.^ 

Self-Report Measures 

Self-report measures include those in which re- 
spondents arc asked to assess the quantity or qual- 
ity of dieir own behavior, or to provide a variety of 
other information that may be used to assess knowl- 
edge, attimdes, demographics, and so on. The 
usual methods for gathering self-repon data are 
paper-and-pencil questionnaires and interviews. 

The major advantages of self-repon data are their 
ease of collection and relatively low cost, and their 
generally high reliability -- i.e., freedom from 
internal measurement error. In addition, self- 
reports are the only practical way to assess cenain 
outcomes -- such as changes in attitudes — which 
are not directiy observable. 

The major disadvanuge of self-reports is their 
questionable validity - i.e., the extent to which 
they provide accurate information. The validity of 
self-reports is the topic of considerable debate, 
especially when used to assess value-laden behav- 
iors. However, self-repons constitute a useful and 
acceptably valid measurement methodology if a 
number of precautions are taken. 

First, the confidentiality of responses must be 
closely guarded, and the procedures for protecting 
confidentiality emphasized to respondents. Sec- 
ond, the imponance of the data collection and the 
respondents' role in the data collection should be 
emphasized. If the respondent feels that he or she 
is a key "parmer" in die evaluation effon, the 
probability of providing valid data will be in- 
creased. 



ERIC 



Finally, self-repon data collection instruments 
should be carefully pilot-tested on respondents 
similar or identical to those who will participate in 
the evaluation. As pan of the pilot-testing proce- 
dure, respondents should be debriefed concerning 
their willingness to answer all sensidve items truth- 
fully, and suggestions should be solicited for ways 
that offensive items may be improved. This last 
may sound simplisdc, but experience suggests that 
pilot respondents are a key source of ideas for ways 
to improve the validity of self-repons. 

Other*Reports 

Other-reports arc idendcal in all respects to self- 
reports, except that data are gathered about a target 
individual from someone other than the target 
himself or herself. 

One source of other-reports are individuals who 
have no stake in the performance of the target 
individual. S uch other-reports may be particularly 
useful in assessing changes in clinical practice 
behavior. For example, rather than relying entirely 
on clinicians' reports regarding whether or not 
they questioned padents about alcohol and other 
drug use, the evaluaior may conduct exit inter- 
views with patients. 

Other-repons may also be gathered from individu- 
als who know the target individual well (spouse, 
other relauve, close friend, etc.). Such reports may 
be useful in studies of patient outcomes, where 
patient self-repons may provide biased estimates 
of alcohol or other drug use 

Behavioral Observations 

Observations may be designed to measure some of 
the same behaviors assessed tl'jough self-repons 
or other-repons. However, observational data col- 
lection relies on the ratings or assessments of 
trained and unbiased observers! For example, ob- 
servations of provider/patient interactions can be 
used to assess interviewing or counseling skills, or 



to assess paoent reactions to the pracdtioner. Simi- 
larly, observation of a lecture or seminar can assess 
the skill of the instructor, the extent to which the 
sessions accomplish the objectives set forth in a 
curriculum guide, the nature of learner interaction, 
and so on. The major disadvantage of observa- 
tional techniques is that the presence of the ob- 
server may change or inhibit the behavior of those 
being observed. Thus, the use of observers is most 
advantageous when it is unobtrusive. 

One variant of observadonal data collection which 
overcomes some of its disadvantages is the use of 
simulated or standardized patients. As discussed 
earlier, this technique employs actors trained to 
present a standardized complaint. Depending on 
the evaluation design employed, the clinicians 
being assessed may either know that they are 
interacting with a simulated or standardized pa- 
tient, or they may simply be told that such patients 
will be intermingled with actual patients seen by 
the clinicians. 

Records Data 

Health and mental health settings keep a variety of 
records on padents which may be used to assess the 
outcomes of faculty development and clinical train- 
ing programs. Admissions records, patient charts, 
and insurance claims records all can be used to 
indicate the extent to which alcohol and other drug 
issues are being addressed with patients. 

For example, the International Classification of 
Diseases, 9th Edition, Clinical Modification (ICD- 
9-CM) is one system used in many medical settings 
for the preparation of third-pany reimbursement 
documentation or in medical record keeping. ICD- 
9-CM provides a standardized, detailed classifica- 
tion of mortality and morbidity information, as 
well as classification of diagnostic, therapeutic, 
and surgical procedures. The ICD-9 diagnostic 
codes relating to substance abuse cover a wide 
range of morbidity. Separate codes are allotted to 
abuse and dependence, and sub-codes within these 
classifications include alcohol abuse and depend- 



ERIC 



2y 30 



tions of drugs including tobacco and various poly- 
drug combinadons. Separate categories are allot- 
ted to acute intoxication, withdrawal, medicai and 
psychiatric sequelae, fetal complications, and per- 
sonal history of alcoholism. Still other codes are 
used to indicate substance abuse as a contributing 
factor to other morbidity. ICD-9 procedure codes 
related to substance abuse include alcoholism and 
drug counseling, and referral to alcoholism and 
drug addiction rehabilitation. 

A general disadvantage of records data is that they 
are usually gathered for purposes other than evalu- 
adon. Thus, the data they contain may be incom- 
plete or inaccurate, or the record keeping methods 
may change during the course of an evaluadon. 
Insurance records appear to be more carefully 
maintained than other medical records in terms of 
fiscal information, but are probably subject to the 
same reporting errors and biases as other patient 
records. 

QUALITATIVE EVALUATION METHODS 

Data Collection Techniques 

As discussed previously, certain types of evalu- 
ation questions, especially more general, explora- 
tory questions, are most appropriately evaluated 
using qualitative methods. The qualitative meth- 
ods most often used are observation, participant 
observation, and qualitative interviewing. 

Observation. Behavioral observation was dis- 
cussed in the previous section. Such observation 
can be highly structured and quantified. For ex- 
ample, observers might count the number of times 
per minute a clinician makes eye contact with a 
patient in order to test hypotheses concerning the 
quality of clinician-patient interaction. On the 
other hand, if the evaluator has not developed 
specific hypotheses, more general observations 
might be carried out. The evaluator might want to 
observe clinician-padent interactions as a way of 
developing a categorizadon of different types of 
intervention approaches or different types of pa- 



tient reactions to interventions. Or the evaluator 
might want to learn something about the general 
tone of interactions or get an idea if and how 
learners are putting into practice the techniques 
suggested in training. Here, a structured observa- 
tion protocol listing the general issues to be ad- 
dressed in the observation sessions could be devel- 
oped to guide the data collection. 

Participant Observation. The examrles ot obser- 
vation techniques described above relate to ob- 
servers who sit quietly in the background and do 
not become involved. Participant observers, on 
the other hand become immersed in the situation, 
experiencing it, at least to some extent, as an actual 
participant would. For example, the participant 
observer might acmally go through an educational 
intervention along widi the other learners in order 
to experience first hand what the learning experi- 
ence is like and to hear the reactions of other 
participants. Such observations would be useful in 
evaluating the quality of implementation of an 
intervention, developing hypotheses about the 
outcome of the intervention, etc. 

Observation and participant observation can be 
impressionistic and even casual in initial explora- 
tory phases of evaluation. However, these obser- 
vation techniques can also be quite systematic, 
with the periods of evaluation precisely selected 
and the observations carefully documented. The 
usual form of documentation is detailed field notes 
describing the setting, the participants, conversa- 
tions and events, as well as the observer's reac- 
tions. The notes should be written immediately 
after the event. Ample time should be allocated to 
the preparation of notes; it may take as much time 
to write the notes as it did to observe the events. 

Qualitative Interviews. Qualitative interviews seek 
in-depth data from the respondent's perspective. 
The interviewer usually works from a list of gen- 
eral topic areas to be covered, but the questions 
used are open-ended. The interviewer does not 
structure the responses or constrain the informa- 
tion the respondent provides. 



ERIC 



24 31 



QUALITATrVE ANALYSIS 



The purpose of qualitative research is to understand 
rather than to enumerate. Therefore, data collected 
using qualitative techniques yields a narrative analy- 
sis that can take a lumber of forais. For example, 
responses to an open ended question can be repro- 
duced entirely, without comment. Responses can be 
organized by topic (e.g., all comments dealing with 
learners' apprehensiveness about discussing alco- 
hol and other drug issues with patients). Qualitative 
data also lends itself very well to the development of 
descripdve case smdies.. When more than one case 
is described, they can be compared and contrasted. 
Frequently, the results of a qualitative study lead to 
more specific hypotheses and typologies that can be 
tested using quantitaove data collection and analy- 
sis. The qualitative data can then be used to ennch 
quanutative reports with illustrations and examples. 



CONCLUSION 



The recent interest m further integrating alcohol 
and other drug issues into the professional training 
of health and mental health professionals promises 
to improve the health care system's response to a 
major cause of monality and morbidity. Careful 
evaluaoons of faculty development and clinical 
training programs in alcohol and other drug abuse 
can help insure that the growth of these programs 
is planful and rational, and that the strategies 



shown to be successful in one institution can be 
exported to othen. 

It is our hope that the materials presented in this 
GuUle Book contribute to an understanding of 
what evaluadons can and cannot accomplish, 
and that interested readers will examine the 
sourcebooks we have recommended in their 
further explorations of evaluation issues. 



NOTES 



Adapted from French, J. and Kaufman* N. (Eds.), 
Handbook of Prevention Evaluation: Prevention 
Evaluation Guidelines. Rockville. MD: National 
Institute on Drug Abuse. 1981. 



-Kish, L.A. 
1967. 



Survey Sampling. New York: Wiley. 



-Cohen. J. Statistical Power Analysis for the Behav- 
ioral Sciences. New York: Academic Press. 1969; 
Cohen. J. and Cohen. P. Applied Multiple Regression 
Analysis for the Behavioral Sciences, New York: 
John Wiley and Sons, 1975. 



*Ibid. 

^Ibid. 

^Adapted from French, J. and Kaufman. N. 
(Eds.), Handbook of P'^evention Evaluation: 
Prevention Evaluation Guidelines. Rockville, 
NfD: National Insdmte on Drug Abuse. 1981. 

^Webb, £. Unconvendonality, triangulaiion. and 
inference. In N. Denzin (Ed.), Sociological 
Methods: A Sourcebook, New York: Aldine 
Press, 1970. 



ERIC 



26 



